Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Re-use work of getting state for a given state_group (_get_state_groups_from_groups) #15617

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented May 17, 2023

Re-use work of getting state for a given state_group. This is useful when we _compute_event_context_with_maybe_missing_prevs(...) -> _get_state_groups_from_groups(...)

Part of making /messages faster: #13356

Background (explaining the situation before this PR)

When we have to _compute_event_context_with_maybe_missing_prevs, _get_state_groups_from_groups currently takes up most of the time. It makes the following recursive query for each state group given which goes up the entire state_group chain and returns the complete distinct state which means thousands of membership events at the very least. For example, with a random /messages request in the Matrix HQ room we did 10x of these queries which each took 0.5-2 seconds and return roughly 88k events every time totaling 14s.

WITH RECURSIVE sgs(state_group) AS (
  VALUES 
    (? :: bigint) 
  UNION ALL 
  SELECT 
    prev_state_group 
  FROM 
    state_group_edges e, sgs s 
  WHERE 
    s.state_group = e.state_group
) 
SELECT 
  DISTINCT ON (type, state_key) type, state_key, event_id 
FROM 
  state_groups_state 
WHERE 
  state_group IN (
    SELECT state_group FROM sgs
  ) 
ORDER BY type, state_key, state_group DESC

https://explain.depesz.com/s/OuOe
https://explain.dalibo.com/plan/ef6bc290f2dced17

EXPLAIN ANAYLZE query output
EXPLAIN ANALYZE WITH RECURSIVE sgs(state_group) AS (
  VALUES 
    (739988088 :: bigint) 
  UNION ALL 
  SELECT 
    prev_state_group 
  FROM 
    state_group_edges e, sgs s 
  WHERE 
    s.state_group = e.state_group
) 
SELECT 
  DISTINCT ON (type, state_key) type, state_key, event_id 
FROM 
  state_groups_state 
WHERE 
  state_group IN (
    SELECT state_group FROM sgs
  ) 
ORDER BY type, state_key, state_group DESC;
                                                                                 QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=7161.42..7199.29 rows=5050 width=91) (actual time=1084.809..1099.816 rows=88180 loops=1)
   CTE sgs
     ->  Recursive Union  (cost=0.00..284.28 rows=101 width=8) (actual time=0.004..21.247 rows=59 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)
           ->  Nested Loop  (cost=0.57..28.23 rows=10 width=8) (actual time=0.358..0.359 rows=1 loops=59)
                 ->  WorkTable Scan on sgs s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.000 rows=1 loops=59)
                 ->  Index Only Scan using state_group_edges_unique_idx on state_group_edges e  (cost=0.57..2.79 rows=1 width=16) (actual time=0.357..0.357 rows=1 loops=59)
                       Index Cond: (state_group = s.state_group)
                       Heap Fetches: 0
   ->  Sort  (cost=6877.14..6889.76 rows=5050 width=91) (actual time=1084.806..1089.100 rows=88221 loops=1)
         Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group DESC
         Sort Method: quicksort  Memory: 14782kB
         ->  Nested Loop  (cost=3.11..6566.51 rows=5050 width=91) (actual time=23.522..890.988 rows=88221 loops=1)
               ->  HashAggregate  (cost=2.27..3.28 rows=101 width=8) (actual time=21.348..21.383 rows=59 loops=1)
                     Group Key: sgs.state_group
                     Batches: 1  Memory Usage: 24kB
                     ->  CTE Scan on sgs  (cost=0.00..2.02 rows=101 width=8) (actual time=0.007..21.282 rows=59 loops=1)
               ->  Index Scan using state_groups_state_type_idx on state_groups_state  (cost=0.84..64.48 rows=50 width=91) (actual time=0.959..14.462 rows=1495 loops=59)
                     Index Cond: (state_group = sgs.state_group)
 Planning Time: 4.485 ms
 Execution Time: 1105.539 ms
(21 rows)

Time: 1112.886 ms (00:01.113)

Full Jaeger trace JSON

Before

Say we have state_groups 1-10 where state_group 1 would be the m.room.create, and then onwards, etc.

Currently, when you ask for the state at state_groups = [5, 8], we will fetch state from 5 -> 1 and 8 -> 1 which is pretty wasteful because we're mostly fetching the same thing again and time consuming (imagine this with Matrix HQ which has chains of 88k events).

flowchart RL
  10 --> 9 --> 8 --> 7 --> 6 --> 5 --> 4 --> 3 --> 2 --> 1
Loading

After

Now instead when you ask for the state at state_groups = [5, 8], we fetch 5 -> 1 as normal, but then only fetch 8 -> 5 and layer it on top of the work we did for the state at state_group = 5.

In terms of query performance, our first query will take the same amount of time as before but any subsequent state_groups that are shared will take significantly less time.

In the sample request above, simulating how the app would make the queries with the changes in this PR, we don't actually see that much benefit because it turns out that there isn't much state_group sharing amongst events. That 14s SQL query is just fetching the state at each of the prev_events of one event and only one seems to share state_groups.

state_group chains not actually shared that much in that 14s database query 😬
flowchart BT
    subgraph $LrOm7aODe5StbJwD9tFAfrTZ1ClzWJLkIUsMH1BMc-Y
        direction BT
        736242551 --> 736244167 --> 736258375 --> 736271166 --> 736272133 --> 736273053 -->
        736279599 --> 736283383 --> 736285535 --> 736306125 --> 736307774 --> 736315167 -->
        736320311 --> 736321249 --> 736329091 --> 736332354 --> 736349788 --> 736355933 -->
        736356577 --> 736359340 --> 736361356 --> 736364055 --> 736364651 --> 736364778 -->
        736366361 --> 736374000 --> 736374226 --> 736376011 --> 736380262 --> 736381521 -->
        736390487 --> 736392574 --> 736394358 --> 736394537 --> 736402720 --> 736411655 -->
        736412991 --> 736414800 --> 736417133 --> 736417721 --> 736424434 --> 736425468 -->
        736426289 --> 736430049 --> 736430450 --> 736433258 --> 736437043 --> 736438637 -->
        736439262 --> 736439384
    end

    subgraph $06merKAyQAOg4R5KSvvwq-aY8YxmLxBYKJuYh8S2oFg
        direction BT
        738551430 --> 738551567 --> 738551631 --> 738551632 --> 738554279 --> 738555253 -->
        738556945 --> 738563579 --> 738564700 --> 738567178 --> 738567279 --> 738567972 -->
        738568036 --> 738568218 --> 738568518 --> 738569471 --> 738570430 --> 738571084 -->
        738574267 --> 738574570 --> 738574739 --> 738576792 --> 738577181 --> 738579766 -->
        738579767
    end

    subgraph $OXO_pg-0sMtWmY23TeA3XWacwe-rtrtRPsCuuC5nI9E
        direction BT
        739921430 --> 739923253 --> 739926162
    end

    subgraph $d_C3gK6O9cHcEkL4wb1ixzgc5I2Z5YYCaaP4PkS2w8Q
        direction BT
        747450250 --> 747454137 --> 747454497 --> 747457549 --> 747458705 --> 747459083 -->
        747460418 --> 747460615 --> 747460879 --> 747460907 --> 747462327 --> 747463184 -->
        747463210 --> 747463728 --> 747471813 --> 747472497 --> 747472883 --> 747473622 -->
        747474075 --> 747475361 --> 747476403 --> 747477315 --> 747478191 --> 747478527 -->
        747479040 --> 747479405 --> 747487181 --> 747487318 --> 747490575 --> 747490649 -->
        747491096 --> 747493955 --> 747494115 --> 747495543 --> 747496126 --> 747496673 -->
        747500245 --> 747501014 --> 747502386 --> 747503796 --> 747504895 --> 747505981 -->
        747507859 --> 747508081 --> 747508938 --> 747511840 --> 747511898 --> 747511899 -->
        747512157 --> 747512397 --> 747514214 --> 747514428 --> 747515134 --> 747515135 -->
        747515473 --> 747515813 --> 747515814 --> 747517299 --> 747517409 --> 747517726 -->
        747517888 --> 747518621 --> 747522938 --> 747522939 --> 747524253 --> 747524349 -->
        747531289 --> 747532433 --> 747537361 --> 747538391 --> 747538672 --> 747538876 -->
        747539259 --> 747539318 --> 747539721 --> 747540484 --> 747540957 --> 747540985 -->
        747541741 --> 747541853 --> 747543282 --> 747543429 --> 747544548 --> 747545428 -->
        747545527 --> 747546551 --> 747546629 --> 747546765 --> 747546878 --> 747546913 -->
        747547518 --> 747548528 --> 747548884 --> 747549000
    end

    subgraph  $R4RlkxDq6my3GMq4P4kEoSgpZPcFxRm_YZGOZcf7KHk
        direction BT
        747894355 --> 747894387 --> 747894392 --> 747894435 --> 747894456 --> 747894481 -->
        747894520 --> 747894566 --> 747894582 --> 747894595 --> 747894607 --> 747894613 -->
        747894654 --> 747894696 --> 747894728 --> 747894756 --> 747894790
    end

    subgraph $HeKxcdhKnOl-ocKIXEHOzHsOvkL7B-cjTFcw9RHfeeY
        direction BT
        748155154 --> 748155201 --> 748155202 --> 748155210 --> 748156064 --> 748156371 -->
        748157020 --> 748157216 --> 748159224 --> 748161119 --> 748161155 --> 748161391 -->
        748161847 --> 748167002 --> 748167718 --> 748172461 --> 748172864 --> 748172962 -->
        748173425 --> 748174163 --> 748174259 --> 748175893 --> 748176100 --> 748177162 -->
        748178903
    end

    subgraph $1xVp4R7Wh8jldeVFvdNoNi0kV9Mp_zPDrnQ9rESs9Vw
        direction BT
        749290472 --> 749291708 --> 749291709 --> 749292485 --> 749292486 --> 749293705 -->
        749293836 --> 749293941 --> 749294167 --> 749294279 --> 749294292 --> 749294295 -->
        749294297 --> 749294324 --> 749294350
    end

    subgraph $hYIHVcmQoxy_VUIk8GpKWynuSQr_n1A-NHGdwh5Qn3U
        direction BT
        %%749290472 --> 749291708 --> 749291709 --> 749292485 --> 749292486 --> 749293705 -->
        %%749293836 --> 749293941 --> 749294167 --> 749294279 --> 749294292 --> 749294295 -->
        749294297 --> 749294496 --> 749294565 --> 749294632 --> 749294668 --> 749294692 -->
        749294733 --> 749294882
    end

    subgraph $CskkFIQnNzFMhfgaqOijXmSDNcJ1VyEzDCyYlQwN934
        direction BT
        745396521 --> 745396885 --> 745397790 --> 745398057 --> 745400234 --> 745401111 -->
        745401737 --> 745401872 --> 745404304 --> 745404475 --> 745406724 --> 745409854 -->
        745410377 --> 745410535 --> 745413344 --> 745417326 --> 745420130 --> 745420131 -->
        745421885 --> 745422806 --> 745423670 --> 745435724 --> 745438992 --> 745439103 -->
        745441593 --> 745441594 --> 745442004 --> 745442437 --> 745443725 --> 745444001 -->
        745447038 --> 745447167 --> 745447479 --> 745450805 --> 745450841 --> 745454335 -->
        776857245
    end
Loading

Initially, I thought the lack of sharing was quite strange but this is because of the state_group snapshotting feature where if an event requires more than 100 hops on state_event_edges, then a Synaspe will create a new state_group with a snapshot of all of that state. It seems like this isn't done very efficiently though. Relevant docs.

And it turns out the event is an org.matrix.dummy_event which Synapse automatically puts in the DAG to resolve outstanding forward extremities and these events aren't even shown to clients so we don't even need to waste time waiting for them to backfill. Tracked by #15632

Generally, I think this PR could bring great gains in conjunction to running some sort of state compressor over the database to get a lot more sharing. In addition to trying to fix the online state_group snapshotting logic to be smarter. I don't know how the existing state_compressors work but I imagine we could create snapshots and bucket for years -> months -> weeks -> days -> hours -> individual events and create new state_group chains which utilize these from biggest to smallest to get maximal sharing.

Dev notes


WITH RECURSIVE sgs(state_group, state_group_reached) AS (
  VALUES 
    (739988088::bigint, NULL::bigint)
  UNION ALL 
  SELECT 
    prev_state_group,
    CASE
      /* Specify state_groups we have already done the work for */
      WHEN @prev_state_group = 739965897::bigint OR @prev_state_group = 739956035::bigint THEN prev_state_group
      ELSE NULL
    END AS state_group_reached
  FROM 
    state_group_edges e, sgs s 
  WHERE 
    s.state_group = e.state_group
    /* Stop when we connect up to another state_group that we already did the work for */
    AND s.state_group_reached IS NULL
)
(
    SELECT NULL, NULL, NULL, state_group_reached
    FROM sgs
    ORDER BY state_group ASC
    LIMIT 1
) UNION ALL (
    SELECT 
      DISTINCT ON (type, state_key) type, state_key, event_id, state_group
    FROM 
      state_groups_state
    /*INNER JOIN sgs USING (state_group)*/
    WHERE 
      state_group IN (
        SELECT state_group FROM sgs
      ) 
    ORDER BY type, state_key, state_group DESC
);

_compute_event_context_with_maybe_missing_prevs
compute_event_context
calculate_context_info
resolve_state_groups_for_events(prev_events)
get_state_for_groups
_get_state_for_groups
_get_state_groups_from_groups
_get_state_groups_from_groups_txn

SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres SYNAPSE_TEST_LOG_LEVEL=INFO python -m twisted.trial tests.test_visibility.FilterEventsForServerTestCase.test_filtering

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

@MadLittleMods MadLittleMods added A-Federation T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) labels May 17, 2023
results[group] = partial_state_map_for_state_group

state_groups_we_have_already_fetched.add(group)

else:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To not complicate the diff, I've held off on applying the same treatment to SQLite.

We can iterate on this in another PR or just opt for people to use Postgres in order to see performance

results[group] = partial_state_map_for_state_group

state_groups_we_have_already_fetched.add(group)

Copy link
Contributor Author

@MadLittleMods MadLittleMods May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After

[...]

In the sample request above, simulating how the app would make the queries with the changes in this PR, we don't actually see that much benefit because it turns out that there isn't much state_group sharing amongst events. That 14s SQL query is just fetching the state at each of the prev_events of one event and only one seems to share state_groups.

[...]

Initially, I thought the lack of sharing was quite strange but this is because of the state_group snapshotting feature where if an event requires more than 100 hops on state_event_edges, then a Synaspe will create a new state_group with a snapshot of all of that state. It seems like this isn't done very efficiently though. Relevant docs.

And it turns out the event is an org.matrix.dummy_event which Synapse automatically puts in the DAG to resolve outstanding forward extremities and these events aren't even shown to clients so we don't even need to waste time waiting for them to backfill. Tracked by #15632

Generally, I think this PR could bring great gains in conjunction to running some sort of state compressor over the database to get a lot more sharing. In addition to trying to fix the online state_group snapshotting logic to be smarter. I don't know how the existing state_compressors work but I imagine we could create snapshots and bucket for years -> months -> weeks -> days -> hours -> individual events and create new state_group chains which utilize these from biggest to smallest to get maximal sharing.

-- "After" section of the PR description

This PR hasn't made as big of an impact as I thought it would for that type of request. Are we still interested in a change like this? It may work well for sequential events that we backfill.

It seems like our state_group sharing is realllly sub-par and the way that state_groups can only have a max of 100 hops puts an upper limit on how much gain this PR can give. I didn't anticipate that's how state_groups worked and thought it was one state_group per-state-change which it is until it starts doing snapshots.

Maybe it's more interesting to improve our state_group logic to be much smarter first and we could re-visit something like this. Or look into the state compressor stuff to optimize our backlog which would help for the Matrix Public Archive. I'm not sure if the current state compressors optimize for disk space or sharing or how inter-related those two goals are.

@MadLittleMods MadLittleMods added the A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db label May 18, 2023
@MadLittleMods MadLittleMods changed the title Re-use work of getting state for a given state_group Re-use work of getting state for a given state_group (_get_state_groups_from_groups) May 19, 2023
@clokep
Copy link
Member

clokep commented Sep 25, 2023

It sounds like this didn't make much of an improvement and significantly complicates the code. I'm going to close this due to that, but I did grab the nice improvements to the comments and move them to #16383.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db A-Federation A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants